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Abstract 

The constant removal of deleterious mutations by natural selection causes a reduction in neutral diversity and efficacy of 
selection at genetically linked sites (a process called Background Selection, BGS). Population genetic studies, however, often 
ignore BGS effects when investigating demographic events or the presence of other types of selection. To obtain a more 
realistic evolutionary expectation that incorporates the unavoidable consequences of deleterious mutations, we generated 
high-resolution landscapes of variation across the Drosophila melanogaster genome under a BGS scenario independent of 
polymorphism data. We find that BGS plays a significant role in shaping levels of variation across the entire genome, 
including long introns and intergenic regions distant from annotated genes. We also find that a very large percentage of the 
observed variation in diversity across autosomes can be explained by BGS alone, up to 70% across individual chromosome 
arms at 100-kb scale, thus indicating that BGS predictions can be used as baseline to infer additional types of selection and 
demographic events. This approach allows detecting several outlier regions with signal of recent adaptive events and 
selective sweeps. The use of a BGS baseline, however, is particularly appropriate to investigate the presence of balancing 
selection and our study exposes numerous genomic regions with the predicted signature of higher polymorphism than 
expected when a BGS context is taken into account. Importantly, we show that these conclusions are robust to the 
mutation and selection parameters of the BGS model. Finally, analyses of protein evolution together with previous 
comparisons of genetic maps between Drosophila species, suggest temporally variable recombination landscapes and, thus, 
local BGS effects that may differ between extant and past phases. Because genome-wide BGS and temporal changes in 
linkage effects can skew approaches to estimate demographic and selective events, future analyses should incorporate BGS 
predictions and capture local recombination variation across genomes and along lineages. 
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Introduction 

The causes of the variation observed within natural populations 
have been a long-standing question in evolutionary and genetic 
studies. Particular insight into these causes can be gained by 
analyzing the distribution of nucleotide diversity across genomes, 
where species- and population-specific parameters such as the 
number of individuals, environmental factors, or demography are 
constant. A number of population genetics models have been put 
forward to explain this intra-genomic variation in diversity, often 
including the predicted consequences that selection acting at a 
genomic site impinges on genetically linked sites, either neutral or 
under selection themselves (i.e., models of 'selection at linked sites'; 
[1-4] and references therein). Although there is general agreement 
that selection at linked sites can play a role shaping levels of 
variation, there is still intense debate and research on the selective 
nature of the mutations causing such effects (e.g., beneficial or 
deleterious) and whether the same causes can be applied to 
different species [4—6]. 

Strongly beneficial mutations rise rapidly to fixation and 
hitchhike adjacent sites, causing a fingerprint of reduced intra- 
specific variation around the selected site known as a 'selective 



sweep' (the HHss model; [2,3,7-1 1]). A qualitatively similar 
outcome can be generated by another model of selection and 
hitchhiking effects at genetically linked sites without requiring 
adaptive changes, just as a result of the continuous input of 
strongly deleterious mutations and their removal by natural 
selection (the background selection (BGS) model [1,4,12-16]). 
Both models also predict that the consequences of selection 
removing adjacent diversity diminish when genetic recombination 
increases, a general pattern that has been observed in many 
species when comparing genomic regions with high and low (or 
zero) recombination rates (reviewed in [1,3-5]). 

The magnitude and distribution of recombination rates across 
genomes play key roles in predicting the consequences of selection 
on adjacent variation. In humans, for instance, the presence of 
large recombination cold spots raised the possibility that BGS 
could reduce polymorphism levels at specific genomic regions. In 
agreement, recent analyses using models of purifying selection 
rather than purely neutral ones suggest that patterns of nucleotide 
diversity across the human genome are consistent with BGS 
predictions [17-21]. In the model system D. melanogaster, low- 
resolution recombination maps described limited or absent 
recombination near sub-telomeric and -centromeric regions 
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Author Summary 

The removal of deleterious mutations from natural 
populations has potential consequences on patterns of 
variation across genomes. Population genetic analyses, 
however, often assume that such effects are negligible 
across recombining regions of species like Drosophila. We 
use simple models of purifying selection and current 
knowledge of recombination rates and gene distribution 
across the genome to obtain a baseline of variation 
predicted by the constant input and removal of deleteri- 
ous mutations. We find that purifying selection alone can 
explain a major fraction of the observed variance in 
nucleotide diversity across the genome. The use of a 
baseline of variation predicted by linkage to deleterious 
mutations as null expectation exposes genomic regions 
under other selective regimes, including more regions 
showing the signature of balancing selection than would 
be evident when using traditional approaches. Our study 
also indicates that most, if not all, nucleotides across the D. 
melctnogaster genome are significantly influenced by the 
removal of deleterious mutations, even when located in 
the middle of highly recombining regions and distant from 
genes. Additionally, the study of rates of protein evolution 
confirms previous analyses suggesting that the recombi- 
nation landscape across the genome has changed in the 
recent history of D. melanogaster. All these reported 
factors can skew current analyses designed to capture 
demographic events or estimate the strength and 
frequency of adaptive mutations, and illustrate the need 
for new and more realistic theoretical and modeling 
approaches to study naturally occurring genetic variation. 

whereas recombination outside these sub-telomeric and -centro- 
meric regions (i.e., across trimmed chromosome arms) has been 
often assumed to be both high and homogeneously distributed. As 
a consequence, variation in nucleotide diversity across trimmed 
chromosome arms has been mostly attributed to positive selection 
and selective sweeps ([4,5,22-29]; but see [30]). 

There are, however, several reasons to believe that BGS effects 
could be significant in D. melanogaster as well. First, compared with 
humans, D. melanogaster has a more compact genome and a larger 
effective population size (j\Q, predicting tighter genetic linkage 
between genes and stronger purifying selection, respectively, and 
both factors forecast greater BGS effects. Second, recent whole- 
genome studies of recombination rates in D. melanogaster exposed 
extensive heterogeneity in the distribution of crossover rates even 
after removing sub-telomeric and centromeric regions [31]. This 
high degree of variation in recombination rates across D. 
melanogaster chromosomes is observed when recombination is 
obtained from a single cross of two specific strains [31,32] as well 
as when analyzing a species' average obtained from combining 
genetic maps from crosses of several natural strains [31]. The 
presence of coldspots of recombination embedded in chromosomal 
regions assumed to have high recombination rates, therefore, 
provides the opportunity for BGS to play a more significant role 
across broader genomic regions than previously anticipated [31]. 
Finally, Charlesworth [33] has recendy showed that BGS effects 
are predicted to be detectable in the middle of recombining 
chromosome arms in D. melanogaster. 

The consequences of BGS at a given nucleotide position in the 
genome (focal point) can be described by the predicted level of 
neutral nucleotide diversity when selection at linked sites is allowed 
(71) relative to the level of diversity under complete neutrality and 
free recombination between sites (k 0 ), with B = n/n 0 [12-15]. 



Therefore, B~ 1 would indicate negligible BGS effects whereas 
B« 1 would suggest very strong BGS and a substantial reduction 
in levels of neutral diversity. B can also be understood in terms of a 
reduction in N e , and variation in B forecasts differences in levels of 
diversity within species but also differences in the efficacy of 
selection, which can be approximated by the product of N e and the 
selection coefficient s. Note, however, that the prediction about 
reduced efficacy of selection is a qualitative one since there is no 
simple scalar transformation of N e influenced by selection at linked 
sites that allows estimating probabilities of fixation of selected 
mutations [34,35]. Thus, a comprehensive study of the predictive 
power of BGS to explain natural variation across genomes needs 
to show that, 1) conditions exists across a genome to generate 
significant overall effects reducing B, 2) B varies across the 
genome, and 3) regions with reduced B are associated with 
reduced levels of polymorphism and efficacy of selection (e.g., 
detectable on rates of protein evolution). 

Here, we investigated what is the fraction of the D. melanogaster 
genome that is influenced by BGS and how much of the observed 
variance in patterns of intra-specific variation and rates of 
evolution across this genome can be explained by BGS alone. 
Importandy, to obtain a sensible BGS baseline that could be used 
to test for positive selection and other departures from neutrality, 
we investigated BGS models that are purposely simple and 
independent of nucleotide variation data. Additionally, we studied 
whether our conclusions are sensitive to parameters of the BGS 
model. To this end, we expanded approaches previously applied to 
investigate human diversity [17,18] to now estimate BGS effects 
across the D. melanogaster genome. 

In all, we generated a detailed description of the consequences 
of purifying selection on linked sites at every 1 kb along D. 
melanogaster chromosomes under a variety of BGS models. Our 
results show that BGS likely plays a detectable role across the 
entire genome and that purifying selection alone can explain a 
very large fraction of the observed patterns of nucleotide diversity 
in this species. Notably, we show that these conclusions are robust 
to different parameters in the BGS models. The use of a BGS 
baseline also uncovers the presence of regions with the signature of 
a recent selective sweep and, less expected, numerous instances of 
balancing selection. Furthermore, analyses of rates of protein 
evolution suggest that the recombination landscape has changed 
recently along the D. melanogaster lineage thus generating disparity 
between short- and long-term JV e at many genomic positions. We 
discuss the advantages of incorporating BGS predictions across 
chromosomes and the potential consequences of temporal 
variation in recombination landscapes when estimating demo- 
graphic and selective events. 

Results 

BGS models 

BGS expectations (i.e., estimates of B) were obtained for every 
1-kb region across the whole genome as the cumulative effects 
caused by deleterious mutations at any other site along the same 
chromosome (see Materials and Methods for details). These 
estimates of B were based on BGS models that include our current 
knowledge of genome annotation at every nucleotide site of the 
genome and high-resolution recombination landscapes in D. 
melanogaster that distinguish between crossover and gene conversion 
rates [31]. These models also incorporate the possibility that 
strongly deleterious mutations occur at sites that alter amino acid 
composition as well as at a fraction of sites in noncoding 
sequences. The inclusion of deleterious mutations in noncoding 
sequences allows taking into account the existence of regulatory 
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and other non-translated functional sequences, either in introns 
and 5'- and 3'-flanking UTRs, or in intergenic regions [22,33,36- 
38] . For each category of selected sites (nonsynonymous, intronic, 
UTR, or intergenic) we used the proportion of constrained sites (cs) 
estimated for D. melanogaster [22,37,38] as the fraction of sites with 
deleterious fitness consequences when mutated [33]. In terms of 
recombination rates, we studied BGS predictions following the 
standard approach of including crossover as the sole source of 
recombination (hereafter models M GO ) and also when combining 
the effects of crossover and gene conversion events (models 
Mco+gc) to better quantify the true degree of linkage between 
sites in natural populations. 

The distribution of deleterious fitness effects (DDFE) and the 
diploid rate of deleterious mutations per generation (U) are 
parameters that have direct implications on estimates of B but are 
more difficult to establish experimentally. Although a gamma 
distribution has been proposed a number of times for deleterious 
mutations [39-44], a log-normal DDFE allows capturing the 
existence of lethal mutations and fits better D. melanogaster 
polymorphism data [45,46]. Additionally, a log-normal DDFE 
predicts a higher fraction of mutations with minimal consequences 
removing linked variation than a gamma DDFE and, ultimately, 
weaker BGS effects (see Materials and Methods). Therefore, the 
use of a log-normal DDFE can be taken as a conservative 
approach when inferring the magnitude of BGS effects. 

Direct estimations of deleterious mutation rates are still fairly 
limited. In D. melanogaster, initial analyses of mutation accumula- 
tion lines estimated a mutation rate for point mutations and small 
indels (a) of ~8.4xl0 9 /bp /generation and a diploid rate of 
deleterious mutations per generation (U) of ~ 1.2 [47]. Neverthe- 
less, one of the lines used in this study had an unusually high 
mutation rate [48] and more recent studies suggest h~4-5x10~ 9 
(c7~0.6) for point mutations and small indels [48-50]. These lower 
estimates, however, do not include the possible presence in natural 
populations of genotypes with high mutation rates or the 
deleterious consequences of transposable element (TE) insertions. 
In fact, TEs are very abundant in natural populations of D. 
melanogaster [51-60] and have been proposed to be an important 
source of BGS in this species [30]. Therefore, U~0.6 represents a 
lower boundary for the deleterious mutation rate when inferring 
the consequences of BGS. To include the consequences of TE 
insertion in our BGS models, we obtained an approximate diploid 
insertion rate of Ute— 0.6 based on a detailed description of TE 
distribution in D. melanogaster [60] and mutation-selection balance 
predictions (see Materials and Methods for details). Thus, a 
genome-wide diploid deleterious mutation rate of ~1.2 per 
generation is a reasonable approximation that captures the 
consequences of point mutations, small indels and the insertion 
of transposable elements. 

To assess how robust our results and conclusions are to the 
parameters of the BGS model, we obtained genome-wide 
landscapes for B under eight different models, with DDFE 
following a log-normal or a gamma distribution (models Mjjq 
and M G , respectively), with deleterious mutations rates that 
include or not TE insertions (models M StdMut and M LowMut , 
respectively), and with recombination taking into account cross- 
over and gene conversion events or only crossovers (models 
Mco+gc and M co , respectively). Unless specifically noted, we 
report results based on the BGS model that is most consistent with 
our current knowledge of gene distribution across the genome, a 
log-normal DDFE, a genome-wide diploid deleterious mutation 
rate of U= 1.2, and recombination rates that include crossover 
and gene conversion events (i.e., our default model is M L N,stdMut, 
co+gc)- Table SI summarizes the results from the BGS models 



and Table S2 provides the full distribution of B estimates across all 
chromosomes. 

Patterns of BGS across the D. melanogaster genome 

Genome-wide estimates of B show a median of 0.591 and 
indicate that the predicted influence of BGS across the D. 
melanogaster genome would reduce the overall N e substantially 
relative to levels predicted by evolutionary models with free 
recombination (see Figure 1A). The study of the distribution of B 
along chromosomes shows that the reduction in neutral diversity is 
severe in a large fraction of the genome, with 19% of all 1-kb 
regions with 5<0.25 and a lower 90% CI for B of 0.005 (Figure IB 
and Figure 2). Importandy, the distribution of B across trimmed 
chromosomes is also highly heterogeneous. As expected, estimates 
of B are strongly influenced by variation in local crossover rates (c), 
with a Spearman's rank correlation coefficients (p) between B and 
c of 0.792 for trimmed chromosomes. As shown in Figure 3, 
however, there is detectable variance in B for a given local c that 
exposes the additional effects of long-range distribution of 
recombination rates and genes when estimating B at a focal point. 

Median B across trimmed chromosome arms is 0.643, with a 
minimum estimate of 0.19. Significant and variable BGS effects 
are, therefore, expected in D. melanogaster not only due to sub- 
telomeric and -centromeric regions but also across trimmed 
chromosomes (see also [33]). This general conclusion does not 
vary qualitatively when considering BGS models with other 
parameters (Table SI). As expected, a model with a DDFE 
following a gamma distribution (Mq) predicts stronger BGS effects 
and lower estimates of B across the genome than when the DDFE 
follows a log-normal (as in our default model). Under model 
M g ,co+gc the median B is 0.428 and the lower 90% CI for B is 
0.001 (median B across trimmed chromosomes is 0.493, with a 
minimum estimate of 0.007). Also anticipated, models with a lower 
deleterious mutation rate (models M LowMut ) generate higher 
estimates of B than when TE insertions are taken into account 
(models M StdMut ). For instance, median B increases from 
0.591(M LNjStdMutjCO+G c) to 0.769 (M LNLowMutjCO+GG ), and from 
0.428 (M G;St dMut,co+Gc) to 0.654 (M G;LowMu ,,co + gc)- In addition, 
the comparison of predictions under models with and without 
gene conversion shows that the standard approach of considering 
crossover as the only source of recombination between sites would 
overestimate linkage effects. Median estimates of B are 20 and 
2 1 % lower for models Mln co and M G GO than for Mln jG o+gc 
and M G GO+GG , respectively. The use of only crossover rates in 
BGS models skews estimates of B particularly in regions with 
intermediate rates (~0.2— 2 cM/Mb), mostly across trimmed 
chromosomes. Both crossover and gene conversion data, there- 
fore, need to be considered to obtain accurate estimates of the 
consequences of selection on linked sites and, in this case, the 
magnitude of BGS effects. 

Finally, it is worth noting that although the different BGS 
models predict different point estimates and ranges of B across the 
chromosomes, all models generate B estimates across the genome 
that have virtually the same relative ranking (i.e., monotonically 
related; Table S3). Pairwise Spearman's p between estimates of B 
from different BGS models range between p = 0.9856 (comparing 
the two most distinct models MLN,stdMut,co and M G LowMut, 
co+gc) and p>0.9999 (for the four comparisons between models 
differing in the deleterious mutation rate). 

No BGS-free regions in the D. melanogaster 
genome. The upper end of the distribution of B across the 
genome is also informative and relevant for population genetic 
analyses of selection and demography that benefit from using 
regions evolving not only under neutrality but also free of linkage 
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Figure 1. Genome-wide estimates of BGS. (A) Boxplots of 
estimates of 6 for complete or trimmed chromosomes and for models 
M L N,stdMut,co+Gc (Mco+gc) and M LN , stdMut , C o (M co )- Results shown for the 
complete genome, and autosomes and the X chromosome separately. 
The median is identified by the line inside the box, the length of the 
box and whiskers indicate 50% and 90% CI, respectively. (B) Frequency 
distribution of 6 estimates for complete or trimmed chromosomes 
under model M LN , stc jMut,co+Go All results based on the analysis of 1-kb 
non-overlapping regions. 
doi:1 0.1 371 /journal.pgen.1 004434.g001 



effects. The upper 90% CI for B is 0.80 across the whole genome 
and 0.814 across the trimmed genome (Figures 1 and 2 and Table 
SI). Out of 119,027 1-kb regions investigated across the genome, 
the highest B is 0.897 and is located in a genomic region with very 
low density of genes and high crossover rate: position 5.027 Mb of 
the X chromosome, in the middle of a large 50-kb region with a 
single and short CDS and a crossover rate off ~14 cM/Mb. The 
1-kb autosomal region with the least BGS effects shows B= 0.822. 
BGS, therefore, plays a significant role across the entire genome, 
including long introns and intergenic sequences distant from genes 
in the middle of regions with high recombination rates. This 
conclusion is unlikely to be influenced by the parameters of the 



BGS model. As indicated, Mg models predict stronger BGS effects 
and lower B and, accordingly, we observe an upper 90% CI for B 
of 0.707 and a maximum estimate of 0.87. On the other hand, 
models with lower deleterious mutation rates generate higher B, 
but even these models predict detectable BGS across the whole 
genome. Ml^mui models generate upper 90% CIs for B ranging 
between 0.81 1 (M GjLowMu t,co) and 0.895(M LNjLowMutjCO+ Gc), and 
the highest 1-kb estimate of B obtained by any of our M Low Mut 
models is 0.947 (see Table SI and Figure SI). That is, these results 
strongly suggest that while there may be a fraction of sites free of 
selective constraints across the D. melanogaster genome, all sites 
might be, nonetheless, under the influence of selection at linked 
sites and BGS in particular. 

Autosomes show stronger BGS effects than the X 
chromosome. The X chromosome in D. melanogaster recom- 
bines at a higher rate than autosomes, caused by a higher median 
crossover rate per female meiosis (2.48 vs. 1.74 cM/Mb for X and 
autosomes, respectively), and the fact that the X chromosome 
spends less time than autosomes in males (that do not recombine 
during meiosis). This higher recombination, in turn, forecasts 
weaker BGS in the X [30,61]. In agreement, Charlesworth [33] 
predicted weaker BGS effects (higher B) in the middle of the X 
chromosome than in the middle of an autosome under a number 
of models. Our study shows the same trend (see Figure 1A and 
Table SI), with median B of 0.559 and 0.736 for autosomes and 
the X chromosome, respectively (0.619 and 0.761 for trimmed 
autosomes and the X chromosome, respectively). The direct 
comparison of all 1 -kb regions reveals that this higher B in the X 
chromosome relative to autosomes is highly significant (Mann- 
Whitney U Test, P<lxl0 -12 for complete and trimmed 
chromosomes), with all BGS models generating the same trend 
and level of significance. 

The ratio of neutral diversity for the X and autosomes (X/ A) 
predicted by our default BGS model is 0.99 and 0.92 for complete 
and trimmed chromosomes, respectively. Therefore, differences in 
BGS effects at the X and autosomes can, at least in part, explain 
the observation that the X/ A ratio of neutral diversity in several D. 
melanogaster populations is higher than the 0.75 predicted by most 
neutral models and a 1 : 1 sex ratio (see [33]). Note that X/ A ratios 
would be overestimated if gene conversion events were not taken 
into account, with X/A ratios of 1.12 and 0.99 for complete and 
trimmed chromosomes, respectively. 

The genomic units of BGS in D. melanogaster 

In this study, all sites along a chromosome were allowed to 
potentially play a role adding up BGS effects at any focal region of 
the same chromosome. To investigate the size of the genomic 
region causing detectable BGS effects in D. melanogaster, we 
estimated the size of the region surrounding a focal 1 -kb needed to 
generate 90% of the total BGS effect obtained when considering 
the complete chromosome (Db90 m either genetic or physical 
units). Equivalently, we also obtained D B75 and D B50 as the size of 
the genomic region needed to generate 75 and 50%, respectively, 
of the total BGS effect obtained when considering the whole 
chromosome. 

The study of complete chromosomes shows a median genetic 
TJbso, Db 7j and T> B50 of 5.5, 1.2 and 0.15 cM, respectively. In 
terms of physical distance, the median Dg 90 , T) B75 and D B50 is 
2024, 477 and 76 kb, respectively (Figure 4A). Although the 
overall effects of BGS are reduced along trimmed chromosomes 
compared to whole chromosomes, the size of the region playing a 
significant role in the final magnitude of B at a focal point is fairly 
equivalent, with 6.9, 1.8 and 0.21 cM for T) Bgo , T> B75 and D B50 , 
respectively (2,412, 640 and 84 kb for T> Bgo , T> B75 and T) BS0 , 
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Figure 2. High-resolution distribution of BGS effects across the D. melanogaster genome. Estimates of BGS effects are measured as B and 
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Figure 3. Relationship between local recombination rates and 
estimates of B across trimmed chromosomes. Local recombina- 
tion rates (c) measured as cM/Mb per female meiosis. Estimates of 8 
based on the default model M LN5tdMutco+GC . Results shown for 10-kb 
non-overlapping regions. 
doi:1 0.1 371 /journal.pgen.1 004434.g003 



respectively). This genetic and physical scale, moreover, increases 
with crossover rates (Figure 4B). This analysis, therefore, suggests 
that the extent of BGS at most genes and intergenic sites across the 
D. melanogaster genome is influenced by the cumulative effects of 
many sites and include numerous other genes. Thus, accurate 
estimates of B in D. melanogaster require the study of genomic 
regions at the cM or Mb scale, ideally full chromosomes. 
Otherwise, BGS can be severely underestimated, and inferences 
about demographic events or other types of selection may be 
unwarranted. These results are also in agreement with the 
previous observation that all intergenic sequences and introns 
across the genome are predicted to be influenced by BGS. 

Estimates of B are a very strong predictor of nucleotide 
diversity across the whole D. melanogaster genome 

A second goal of this study was to estimate how much of the 
observed levels of neutral diversity across the D. melanogaster 
genome can be explained by a BGS landscape obtained 
independently from variation data. A strong positive correlation 
would not only indicate that BGS should not be ignored in 
population genetic analyses but also that our estimates of B are 
likely suitable as baseline to infer additional types of selection and/ 
or demographic events. Because our best experimentally-obtained 
whole-genome recombination maps for crossover and gene 
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Figure 4. Genomic distance influencing patterns of BGS in D. 

melanogaster. (A) Boxplots of estimates of D B50 , D B75 and D B90 . D B90 is 
defined as the size of the genomic region around a focal point needed 
to generate 90% of the total BGS effect obtained when considering the 
whole chromosome. Equivalently, D B75 and D B50 indicate genomic 
distances needed to generate 75 and 50%, respectively, of the total BGS 
effect obtained when considering the complete chromosome. The units 
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(kb). (See Figure 1 legend for further explanation of boxplots.) Results 
shown for the default model M L N,stdMut,co+Go (B) Relationship between 
local recombination rates (c; cM/Mb per female meiosis) and estimates 
of D 675 in genetic distance. Results shown based on the analysis of 1 -kb 
non-overlapping regions across the whole genome (Spearman's 
p = 0.907, P<1 x10~^). 
doi:10.1371/journal.pgen.1004434.g004 



conversion have a maximum resolution and accuracy at the scale 
of 100-kb [31], the predictive nature of the B baseline was first 
investigated at this physical scale. 

To obtain levels of neutral diversity across the D. melanogaster 
genome, nucleotide diversity per bp (n s ^j at introns and intergenic 
sequences was estimated from a sub-Saharan African population 
(Rwanda RG population [62]; see Materials and Methods for 
details). The comparison of estimates of B generated by our BGS 
models and levels of 7t si i across the genome reveals a strikingly 
positive association (Table 1 and Figure 5). For autosomes, the 
correlation between B and Tt s ;i is p =0.770 (965 non-overlapping 



100-kb regions, P<lxl0 % and increases up to p = 0.836 (P< 
1 x 1 0 ~ 1 2 ) along individual autosome arms. Equivalent results are 
obtained when silent diversity is analyzed separately at intergenic 
and intronic sites, with p =0.736 between B and 7i; ntergen ; (: , and 
p =0.741 between B and n^t^ (P<1 xlO 12 in both cases). The 
study of individual autosome arms shows a positive association up 
to p = 0.799 and 0.800 for intergenic and intronic sites, 
respectively (P< 1x10 12 in both cases). 

The predictive nature of the B landscape in D. melanogaster 
remains remarkably high along trimmed autosomes where BGS 
has been often assumed to play a minor role explaining variation 
in levels of polymorphism. The correlation between B and 7t s j] is 
p =0.529, ranging up to p = 0.655 when trimmed chromosome 
arms are analyzed separately (P<1 xlO -12 in all cases). Addition- 
ally, the BGS models investigated generate a stronger association 
between B and n sii than between estimates of local crossover (c) 
and 7t s ji, particularly along trimmed chromosomes (p =0.677 and 
p = 0.397 for complete and trimmed autosomes, respectively). This 
last result exposes the limitations of using local c as an estimate of 
the overall strength of linked selection at a given genomic position, 
and highlights the importance of including long-range information 
of recombination rates and gene structures. Altogether, these 
results show the high predictive value of simple BGS models, with 
almost 60% of the observed variance in 7i s ;i across 100-kb 
autosomal regions explained by BGS, a percentage that is as high 
as ~70% when investigating variation in nucleotide diversity 
along individual chromosome arms (see Table 1). 

Robustness to different BGS parameters. The results 
presented above suggest that BGS could explain a very large 
fraction of the observed variation in diversity across the genome 
under the model that fits better our current selection and mutation 
data in D. melanogaster (model Mug s^M;,,), Importantly, BGS 
models using different DDFEs (M G instead of Mln) and/or a 
lower deleterious mutation rate (M LowMot instead of M StdMut ) 
generate equivalent results. Table 1 shows p between B and 7t sil for 
the different BGS models investigated: p between B and 7t si i ranges 
between 0.749 and 0.773 for complete autosomes, and between 
0.514 and 0.531 for trimmed autosomes (P< 1x10 12 in all cases). 
The study of intergenic and intronic sites separately generates 
similar outcomes, with p between B and Ttjutergenic ran g m g between 
0.736 and 0.739, and p between B and 7t intr „ n ranging between 
0.741 and 0.744 for the different models (P<1 xlO -12 in all cases). 
The similarity of outcomes should not be surprising based on the 
very high pairwise rank correlations between estimates of B 
generated by the different BGS models described above (Table 
S3). Taken together, these results emphasize the robustness of the 
approach to generate genome-wide baselines of B to study 
variation in nucleotide diversity along chromosomes. 

The X chromosome. Langley et al. (2012) [26] recendy 
showed that the correlation between crossover rates and levels of 
polymorphism is weaker along the X chromosome than in 
autosomes. In agreement, we also observe that the association 
between B and along the X chromosome is weaker than in 
autosomes, although it is still highly significant. Estimates of p 
between B and 7t sil range between 0.564 and 0.568 for the different 
BGS models (P<lxl0~ 12 in all cases), and between 0.366 
(P=4xl0~ 7 ) and 0.373 (P=2xl0 -7 ) for the trimmed X 
chromosome. Moreover, 7t s ^ shows a weaker correlation with 
crossover rates than with B also along the X: p between and c is 
0.526 (P<1 xlO" 12 ) and 0.322 (P= 9. 1 x 10" 6 ) for the complete 
and trimmed X chromosome, respectively. These results are 
consistent with BGS playing a weaker role along the X 
chromosome due to higher recombination rates (see above and 
[26,30,33,61]). As discussed below, however, additional causes 
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Table 1. Correlation coefficients between estimates of B 
models. 


and levels of polymorphism at noncoding sites (tc sM ) for different BGS 




Spearman's rank correlation // 




DDFE 


U 


Recombination 


Complete Chromosomes 


Trimmed Chromosomes 


BGS Model MLN.stdMut.co+Gc 


Log-normal 


1.2 


CO+GC 


0.770 (0.836) 


0.529 (0.655) 


MG,StdMut,CO+GC 


Gamma 


1.2 


CO+GC 


0.772 (0.838) 


0.531 (0.659) 


^LN,LowMut,CO+GC 


Log-normal 


0.6 


CO+GC 


0.770 (0.836) 


0.529 (0.655) 


MG,LowMut,CO+GC 


Gamma 


0.6 


CO+GC 


0.773 (0.838) 


0.531 (0.659) 


M[_N,StdMut,CO 


Log-normal 


1.2 


CO 


0.749 (0.808) 


0.497 (0.598) 


MG,StdMut,CO 


Gamma 


1.2 


CO 


0.761 (0.824) 


0.514 (0.630) 


M[_N,LowMut,CO 


Log-normal 


0.6 


CO 


0.749 (0.808) 


0.497 (0.596) 


^G,LowMut,CO 


Gamma 


0.6 


CO 


0.761 (0.824) 


0.514 (0.630) 


Local Crossover c 






CO 


0.677 (0.732) 


0.397 (0.486) 



Spearman's rank correlation coefficient [p) between estimates of B and ji si | (P<1 x10~ 12 for all cases) 
(P<1 x10~ 12 ). Results shown based on the analysis of non-overlapping 100-kb autosomal regions. 
a Values in parenthesis indicate the highest p between B and 7i si | along a single chromosome arm. 
doi:1 0.1 371 /journal.pgen.1 004434.t001 



Also shown, p between local crossover rates (c; cM/Mb) and 7i si | 



might be influencing X-linked extant variation, and include higher 
effectiveness of selection relative to autosomes. 

Candidate regions for recent selective sweeps or 
balancing selection using BGS predictions as baseline 

The robustness and high predictive power of the BGS models to 
explain qualitative trends of nucleotide diversity across the 
genome, suggest that we can investigate the presence of other 
forms of selection by searching for regions that depart from BGS 
expectations. We, therefore, compared observed 7t sil and levels of 
diversity predicted by B, and parameterized departures by using 
studentized residuals (ji s ;i-r; see Material and Methods). Overall, 
the distribution of 7t sil _ R does not show a significant departure from 
normality (/ z = 28.9, df. = 23, P= 0.183) thus validating the 
approach. Nevertheless, there are 58 outlier regions with nominal 



0.770 



0.678 




0.564 



0.295 




100kb 10kb 1kb 
Autosomes 



100kb 10kb 1kb 
X chromosome 



Figure 5. Correlation coefficients between estimates of B and 
levels of polymorphism at noncoding sites (tt sM ). Spearman's rank 
correlation coefficients (p) based on the analysis of 100-, 10- and 1-kb 
non-overlapping regions are shown above columns (P<1 x10 -12 in all 
cases) and the number of regions analyzed is shown within columns. 
Results shown for the default model M LN-S tdMut,co+Go 
doi:1 0.1 371 /journal.pgen.1 004434.g005 



P<0.05, 24 regions with significantly negative ti s h-r (revealing a 
deficit in n s a relative to BGS expectations) and 34 regions with 
significantly high 7I s ;i_r (revealing a relative excess of 7t s ;i). 

Regions with a relative deficit of 7i s u are candidate regions for 
recent adaptive events [2,8,63] and our data confirms the presence 
of several regions with the fingerprints of a recent selective sweep 
across the D. melanogaster genome [22,23,25,26,29]. The strongest 
signal of selection at this 100-kb scale, and the only region that 
shows a departure that remains significant after correction for 
multiple tests [P< 1.6x10 s , with false discovery rate (FDR) q- 
value<0.10], suggests a recent selective sweep at position 8.0- 
8. 1 Mb along chromosome arm 2R. This genomic region includes 
gene Cyp6gl and also showed the strongest signal of directional 
selection and selective sweep in large-scale population genetic 
analyses of North American [26] and Australian D. melanogaster 
[64] populations. Not all regions with a severe reduction in 7i sil 
across the trimmed genome, however, may need recent adaptive 
explanations. A number of regions with 7i s y much smaller than the 
median (e.g., 0.002 relative to a median of 0.008; see Figure 6A) 
also show estimates of B of 0.25 or smaller and, thus, the observed 
7t si i would be close to the predicted level of neutral diversity when a 
BGS context is taken into account. 

Uncovering the signature of balancing selection. Numerous 
regions show an excess of 7t s ji relative to the BGS baseline and are, 
therefore, candidates for containing sequences experiencing balancing 
selection. Importantly, many of the regions showing significantly 
higher 7t sil _ R would not stand out unless when compared to region- 
specific BGS predictions. At this scale, for instance, the second 
strongest departure from BGS expectations is located at position 18.4— 
18.5 Mb along chromosome arm 3L (P= 1 xlO with a 7i s a of 
0.0093 that would not be noticeably high using the standard approach 
of comparing it to genome-wide expectations (Figure 6A). Another 
candidate region under balancing selection is located in cytological 
location 38A, showing a level of diversity also close to the average 
along autosomes but almost three-fold higher than the level predicted 
by B (P= 0.002). This candidate genomic region with a relative excess 
of 7t s ;i is within a QTL detected for variation in life span and proposed 
to be maintained by balancing selection in D. melanogaster [65-67]. 
Certainly, the detection of many regions with a relative excess of 7t s ^ 
based on BGS expectations opens the possibility that molecular 
signatures of balancing selection may have been masked by BGS 
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Figure 6. Relationship between estimates of B and jt si |. (A) Relationship between estimates of 6 and 7i si | for 100-kb autosomal regions. 
Spearman's p = QllQ (1,189 regions, P<1 x10 -12 ). (B) Relationship between estimates of B and Ti si | for 10-kb autosomal regions. Spearman's p = 0.678 
(8,883 regions, P<1 x1CT 12 ). Dark blue and red diamonds indicate regions with 7i si | significantly different (higher or lower) than predicted based on 
residual analysis: dark blue (P<0.01), red (FDR-corrected cj<0.10). Results shown for the default model M LNiSt diviut,co+Gc- 
doi:1 0.1 371 /journal.pgen.1 004434.g006 



effects when compared to purely neutral expectations without linkage 
effects, and thus be more prevalent than currently appreciated. 

Analyses of BGS effects and outliers of diversity at 10-kb 
and 1-kb scales 

Because the genome-wide recombination maps used in this 
study were generated to have good accuracy at the scale of 100-kb, 
our BGS models assumed homogeneous rates within each 100-kb 
region. Notably, these models predict variation in B across 100-kb 
regions due to the heterogeneous location of genes and exons 
within these regions and the differential effects of proximal and 
distal flanking regions. Nevertheless, detailed analyses of a few 
small genomic regions have revealed recombination rate variation 
at a smaller scale in Drosophila [32,68]. Therefore, outliers from 
BGS expectations at scales smaller than 100-kb can reveal the 
localized fingerprints of other types of selection (directional or 
balancing selection) but the possibility of uncharacterized hetero- 
geneity in recombination within these regions cannot be formally 
ruled out. That said, the study of the relevant size of the genomic 
region adding up BGS effects at a given focal point (with T>b75 > 



200 kb; see above) suggests that very local recombination variation 
may play a limited role influencing B at a focal point. With these 
caveats and considerations in mind, we investigated the presence 
of outliers at the scale of 10 and 1-kb to identify candidate regions 
under positive or balancing selection using the approach discussed 
for 100-kb regions. 

The strong relationship between B and the observed level of 
silent diversity is maintained when analyzing smaller regions 
(Figure 5). At 10-kb scale, B remains a very good predictor of 7l sil 
along complete autosomes (p = 0.678, 8,883 regions; Figure 6B) 
whereas p is 0.551 (55,467 regions) at the finest scale of 1-kb (P< 
lxlO" 12 in both cases). The use of BGS models with different 
parameters (M GjStdMu t, M GLowMut or Mt^lowMuO generate 
virtually equivalent results, with p between estimates of B and 
7i s ii ranging between 0.678 and 0.682 for analyses at the 10-kb 
scale, and with p ranging between and 0.551 and 0.554 for 
analyses at the 1-kb scale. As observed before, B along the X 
chromosome shows reduced association with 7i sil than for 
autosomes also at 10- and 1-kb resolution. For X-linked regions, 
the correlation between B and 7t s fl is p = 0.397 (1,979 regions) and 



Table 2. Correlation coefficients between estimates of B and rates of protein evolution (co R ). 









Spearman's // 


Probability 


Individual genes 


All 


-0.057 


2x10~ 12 




Autosomes 


-0.071 


5.7x10~ 8 




X chromosome 


-0.189 


3.4 x10~ 8 


100-kb regions 


All 


-0.187 


6.6 x10~ 9 




Autosomes 


-0.160 


6.1 x10~ 6 




X chromosome 


-0.367 


1.5x10~ 6 



Rates of protein evolution (to R ) were obtained after controlling for selection on synonymous mutations based on residual analysis (see Materials and Methods for 
details). Spearman's rank correlation coefficients (p) are shown between B and co R for individual genes and for the average S and oj r for all genes within 100-kb non- 
overlapping regions. 
doi:1 0.1 371 /journal.pgen.1 004434.t002 
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0.295 (12,680 regions) for 10- and 1-kb regions, respectively (P< 
1 x 10" 12 in both cases). 

Outliers at 10-kb scale. Among the 10,812 regions under 
analysis, 208 depart from expectations with nominal P<0.01, 20 
showing a relative deficit of 7t sil and 188 with a relative excess of 
7t s ;i. Nine of these regions remain significant after correction for 
multiple tests (FDR Rvalue <0. 10), and all of them show an excess 
of 7t s j] (Figure 6B). Among the candidate regions for a recent 
selective sweep (relative deficit of 7t sil ), we detect 3 genomic regions 
with clusters of several 10-kb regions with nominal P<0.01. In 
agreement with the analyses at 100-kb scale, 10 consecutive 10-kb 
regions show significant deficit of variation, near gene Cyp6gl (see 
above and [26,64]). A second cluster of six 10-kb regions with 
reduced variation within a 120-kb interval is detected in 
chromosome arm 2R, with a peak signal centered at gene Dll 
(Distal-less), a transcription factor that plays a role in larval and 
adult appendage development. A third region with two adjacent 
1 0-kb regions showing a relative deficit of 7t sil is located in the X 
chromosome, centered at gene CG32783 (a protein-encoding gene 
with no known orthologs outside the melanogaster subgroup). 

On the other hand, we detect many 10-kb regions that show a 
relative excess of diversity and are candidate regions for balancing 
selection. The strongest signal associated with excess of 7t sil based 
on 100-kb regions now shows significantly higher variation at four 
adjacent 10-kb regions (19.360-19.400 Mb along chromosome 
arm 2L). Other strong candidate regions under balancing selection 
detected at this fine scale (all with FDR value < 0.10) include the 
genes PH4clME2 (oxidation-reduction process), dpr20 (sensory 
perception of chemical stimulus), Dhcl 6F (regulation of transcrip- 
tion and microtubule-based movement) and Gai. Note that gene 
PH4olNE2 was also shown to have an excess of protein 
polymorphism in the sister species D. simulans [24] , thus potentially 
revealing a rare case of trans-specific balancing selection. 

To further investigate whether the outlier regions are indeed 
associated with recent selective sweeps and balancing selection, 
we estimated Tajima's D (D T ; [69]) to quantify potential 
differences in the frequency of polymorphic variants within the 
population. Selective sweeps and balancing selection are predict- 
ed to generate an excess of low-frequency (negative D?) and 
intermediate-frequency (positive Dt) variants, respectively. In 
agreement, outlier regions with negative and positive 7t sil . R have 
more negative (Mann- Whitney Z7 Test, P= 4.3 x 10 10 ) and more 
positive (P<lxl0~ 12 ) P>r, respectively, than regions not depart- 
ing from BGS expectations. Moreover, we observe a positive 
association between 7t s ji_R and i>r across the genome (p = 0.280, 
P<lxl0 12 ) that is stronger than between 7t s ;i and P>i 
(p = 0.087). Notably, this association between 7I s ;i_r and P>r 
increases across trimmed chromosomes (p = 0.302 and p = 0.637 
for trimmed autosomes and the X chromosome, respectively; P< 
lxlO -12 ). Taken together, these results reinforce the concept 
that estimates of 7t s ii-R obtained when using BGS predictions as 
baseline are a good predictor of recent selective sweeps and 
balancing selection, capturing departures in number of polymor- 
phic sites as well as the expected consequences on variant 
frequency. 

Outliers at 1-kb scale. The study of 1-kb regions reveals 
1,213 of them having diversity levels that depart from BGS 
expectations with nominal P<0.01, 1,186 and 27 of these 
regions show a relative excess and deficit of 7t s u, respectively. 
Fifty-two regions (0.076%) show an FDR corrected q<0A0, all 
with higher 7i sil than predicted by the BGS model. Twenty-five 
out of the 27 1-kb regions showing a reduction in xc sil with P< 
0.01 cluster together at position 8.017-8.103 Mb along chro- 
mosome arm 2R, in agreement with the analyses at 100- and 



10-kb scales that detected outlier regions near gene Cyp6gl. This 
more detailed analysis suggests that the target of selection is 
likely located at or proximal to Cyp6gl. The other two outlier 
regions with deficit in 7C s a are located at genes cic (regulation of 
transcription) and CGI 1266 (mRNA binding and splicing). 

In terms of regions with an excess of 7t sa , the strongest signal of 
balancing selection based on 100- and 10-kb analyses (19.3-19.4 along 
chromosome arm 2L) is now localized with more precision close to 
genes CG17349 and CG17350. Other regions detected at 10-kb scale 
are also identified at this finest scale and include genes PH4&NE2, 
dpr20, and Dhcl6F(aH with FDR 9 -value<0.10). Additional candidate 
regions under balancing selection include genes CecAll CecA2/CecB 
(antibacterial humoral response), Sema-bc (olfactory behavior), CG5946 
(inter-male aggressive behavior), IM4 (defense response), Cjp6al6 (a 
P450 gene), and three genes encoding cuticular proteins (CprllA, 
Cpr62Bb and Cpr65Ec), among others. Finally, the study of 1-kb regions 
also shows a strong positive relationship between 7t sil . R and P>r 
(p = 0.477, P< 1x10 12 ), in agreement with the expected consequenc- 
es of selective sweeps and balancing selection on the frequency of 
segregating variants (see above). 

Negative relationship between estimates of B and the 
rate of protein evolution 

Another prediction of the models of selection and linkage (either 
HHss or BGS) is that, parallel to a reduction in intra-specific 
variation, there will be a reduction in efficacy of selection (i.e., the 
Hill-Robertson effect [10,70-76]). This general prediction has 
been previously confirmed in Drosophila using local low-resolution 
crossover rates as indirect measure for the magnitude of Hill- 
Robertson effects acting on a gene. These studies showed weak but 
highly significant associations between crossover rates and 
estimates of codon usage bias or rates of protein evolution 
[71,74,75,77-85]. 

To investigate whether B landscapes also capture differences in 
efficacy of selection, we focused on selection against amino acid 
substitutions along the D. melanogaster lineage, after split from the D. 
simulans lineage (less than 5 mya [86]). To this end, we obtained the 
ratio of nonsynonymous to synonymous changes (co, co = d M / d s ) for 
6,677 protein encoding genes and, more informatively, the variation 
in to after controlling for selection on synonymous mutations based 
on residual analysis (co R ; see Materials and Methods for details). 
When each gene is analyzed as a single data point, there is a 
negative association between B and cor (p = —0.086, P= 2x10 12 ; 
Table 2). Interestingly, and contrary to the results of nucleotide 
diversity, the X chromosome shows a tendency for a stronger effect 
of B on co R than autosomes: p = -0.189 (P= 3.4 xl0~ 8 ) and p= — 
0.071 (P= 5.7x10 8 ) for X-linked and autosomal genes, respec- 
tively. An equivalent but more clear pattern is observed at the scale 
of the resolution of our recombination maps (100 kb), where 
estimating the average co and co R for all genes within each region 
also allows for reducing idiosyncrasies of different genes influencing 
rates of protein evolution (e.g., gene expression breadth and levels, 
protein length, etc.; see [84]). At this scale, variation in B is 
negatively associated with co R along autosomes (p=— 0.160, 
P=6.1xl0~ 6 ) and the X chromosome (p = -0.367, 
P= 1.5x10 ; Table 2). A gain, the association between estimates 
of B and rates of protein evolution is robust to different BGS models 
and parameters. Equivalent p are obtained for all eight BGS models 
investigated, and this is observed when analyzing individual genes (p 
between B and co R ranging between —0.0856 and —0.0874; PS 
3x10 12 ) and when using the average cor for genes within 100-kb 
regions (p ranging between -0.187 and -0.193; P<5.9xl0~ 9 ) 
across the whole genome. 
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Temporal variation in recombination landscapes and its 
consequences 

The association between B and rates of amino acid substitution 
along the D. melanogaster lineage, although highly significant in 
terms of associated probability, is much weaker than that for levels 
of polymorphism at silent sites. Heterogeneity in overall evolu- 
tionary constraints among proteins is expected to add substantial 
variance when investigating the consequences of B on oj relative to 
studies of B on K s]1 because 7i sil is only influenced by local N e and 
the mutation rate. Nevertheless, temporally variable recombina- 
tion rates at a given genomic location would also reduce the 
association between B and (U. Indeed, the high degree of intra- 
specific variation in crossover genetic maps within current D. 
melanogaster populations [31] together with differences in genetic 
maps between closely related Drosopkila species [87-89] support 
the notion that recombination landscapes vary within short 
evolutionary scales, at least across trimmed chromosomes. Under 
this scenario, extant recombination rates and the corresponding 
estimates of linkage effects would be poor predictors of interspe- 
cific rates of protein evolution, even between closely related 
species. 

In this study across the D. melanogaster genome, the use of 
recombination rates obtained experimentally would provide only 
an approximation for the relevant B along the lineage leading to D. 
melanogaster populations [84]. These estimates of B would be an 
even weaker predictor of cor (or ffl) along the D. simulans lineage 
after split from the D. melanogaster lineage. In agreement, we 
observe no significant relationship between B and co R estimated 
along the D. simulans lineage (p = —0.009 based on the default 
BGS model whereas the other BGS models generate p ranging 
between —0.014 and +0.01 1; P>0.25 in all cases). A similar result 
has been obtained in comparisons of local crossover rates and rates 
of protein evolution between two other closely related Drosopkila 
species, D. pseudoobscura and D. persimilis [88] . 

In species where BGS plays a significant role, temporal 
fluctuations in recombination landscapes could influence a 
number of analyses of selection that assume constancy in JV e , 
including estimates of the fraction of adaptive substitutions (a. 
[6,90-92]). Several studies have shown that the bias in estimating 
a can rapidly reach considerable values as a consequence of 
demographic changes, with a being overestimated when N e 
influencing polymorphism is larger than M e influencing divergence 
{Ne_Poi > Ne_Dim [34,91]). We propose that temporal changes in 
recombination at a given genomic position would generate an 
equivalent scenario, with a B influencing polymorphism [B pjj that 
differs from long-term B influencing divergence {B_b^), or the 
corresponding terms for local N P Because long term jV e can be 
approximated by its harmonic mean [93], temporal fluctuations of 
recombination landscapes (and of local B and, therefore, local JVQ 
would also predict a tendency for local Ne P^Ne Div Such 
scenario would allow amino acid changes to make a larger relative 
contribution to divergence than to polymorphism, particularly in 
regions where recombination has recently increased, and thus bias 
estimates of a. upward. 

Precise quantitative predictions of the potential bias in a would 
minimally depend on the rate, magnitude and physical scale of the 
variation in recombination landscapes along lineages. To obtain 
initial insight into the effects of temporal changes in recombination 
rates on estimates of cc, we investigated a rather simple and 
conservative scenario with forward population genetic simulations. 
In particular, we used the program SLIM [94] to capture the 
consequences of temporal changes in linkage effects on estimates 
of a when only neutral and deleterious mutations occur along an 
archetypal 1-Mb region for D. melanogaster that includes 100 
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Figure 7. Estimates of a due to temporally fluctuating 
recombination rates and variable BGS effects. Results based on 
forward population genetic simulations of 10,000 diploid individuals 
(N), a chromosome segment of 1 Mb containing 100 genes, and two 
types of mutations: neutral and deleterious (see Materials and Methods 
for details). Cycles of fluctuating recombination followed phases of 
moderately high recombination for IN generations (H rec phase) and 
moderately low recombination for 3W generations (L rec phase). 
Estimates of a at negatively selected sites obtained following the 
models proposed by Eyre-Walker and Keightley [91,95] every 0.1 N 
generations. Blue and red lines indicate estimates of a assuming 
constant population size and variable population sizes, respectively. 
Continuous and dashed lines indicate estimates of ct with and without 
correction for the effect of polymorphism to divergence, respectively. 
doi:10.1371/joumal.pgen.1004434.g007 

protein coding genes (see Materials and Methods for details). 
Figure 7 shows the results of estimating a at selected sites under 
fluctuating recombination rates, with cycles of moderately high 
recombination for IJV generations (with JV indicating the diploid 
population size) followed by moderately low (not zero) recombi- 
nation for 3jV generations. Estimates of cc based on models that 
assume constant population size [34,9 1 ,95] overestimate the true a. 
particularly when extant recombination is high (a>0.3), with an 
overall a~0.15 when the data is combined from all temporal 
points. As expected, models allowing for population size change 
[91] generate more unbiased estimates of a that show, nonetheless, 
a tendency upward, likely due to limitations assessing older 
population size changes. 

Discussion 

Discerning the relative contribution of different types of 
selection to patterns of intra- and interspecific variation is not 
trivial, in part because essential population and demographic 
parameters are often not known and the potential interactions 
among them are still poorly characterized. Here, we obtained the 
baseline of diversity across the genome predicted by BGS models 
completely independent of the data on nucleotide variation. The 
results of this study suggest that there might be no euchromatic 
region completely free of linkage effects to deleterious mutations in 
D. melanogaster. Instead, there are only genomic sites (neutral or 
under selection) associated with diverse degrees of BGS effects and 
thus under highly variable local N e even across the recombining 
regions of the genome. In this regard, the heterogeneity in local M e 
may bias population genetic estimates of selection or recombina- 
tion if such variation is not taken into account. The pervasive 
presence of BGS has also potential consequences on demographic 
studies because BGS is known to generate a moderate excess of 
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low-frequency variants [96-102]. As a result, demographic 
inferences based on allele-frequencies may be skewed, with 
consistent patterns suggestive of a recent population expansion. 

The next question investigated was how much of the patterns of 
variation in D. melanogaster could be explained by purifying 
selection alone. The results are consistent with BGS playing a 
major role in explaining the observed heterogeneity in nucleotide 
diversity across the entire D. melanogaster genome. At 100-kb scale, 
BGS can explain -58% (p = 0.749-0.773 for different BGS 
models) of the variation of n s n across the genome. The study of 
smaller regions reduces the statistical association between B and 
7t si i, but even when analyzing 10- and 1-kb regions, B explains 
-46% (p = 0.678-0.682), and -30% (p = 0.551-0.554) of all the 
observed variance in 7t sil , respectively. These percentages increase 
up to -70% (p = 0.836) for 100-kb regions, -53% (p = 0.728) for 
10-kb regions, and —36% (p = 0.599) for 1-kb regions, across 
individual chromosome arms. 

Importandy, these conclusions are robust to differences in 
parameters of the BGS model (e.g., deleterious mutation rates, 
DDFE, or recombination) even though the precise magnitude of 
BGS effects does depend on these parameters. Median estimates of 
B range between 0.337 and 0.769 for different BGS models, but 
the distribution of B along the chromosomes maintains the same 
rank order (pairwise p between estimates of B generated by the 
BGS models is &0.9856) and is similarly associated with the 
observed distribution of nucleotide diversity. 

Taken together, the results and conclusions of this study imply 
that null expectations based on models that ignore linkage effects 
to deleterious mutations (or assume homogeneous distribution of 
N e across the genome) are likely inaccurate in D. melanogaster, even 
across trimmed chromosomes. The fact that deleterious mutations 
are more frequent than mutations involved in balancing selection 
or adaptive events, together with the robustness of our conclusions 
to reasonable ranges of selection and mutation parameters, suggest 
that BGS predictions may be adequate as a baseline of diversity 
levels and can be used detect outlier regions subject to other 
selective regimes. 

The use of B as baseline reveals several regions with signal of 
recent directional selection and associated selective sweeps, where 
levels of variation are lower than those predicted by BGS alone 
(e.g., near gene Cyp6gl [26,64]). In agreement with expectations 
after a selective sweep, these regions also show an excess of 
variants at low frequency. A number of other regions with reduced 
levels of variation, however, are located in recombining genomic 
regions where BGS is predicted to have strong effects. These 
results, therefore, are consistent with a detectable but limited 
incidence of recent classic 'hard sweeps' (where diversity is fully 
removed near the selected sites) caused by beneficial mutations 
with large effects. Still it must be acknowledged that older selective 
sweeps, sweeps caused by weakly beneficial mutations, or 'soft 
sweeps' associated with standing genetic variation or polygenic 
adaptation [103,104] would not be detected in this study. 

In fact, estimates of the proportion of adaptive substitutions 
indicate that beneficial mutations are not rare in Drosophila [6,90- 
92]. Moreover, analyses of nucleotide diversity around amino acid 
substitutions suggest that a majority of these beneficial mutations 
involve small effects on fitness [25,105,106] and cause detectable 
but very localized reduction in adjacent diversity (at the scale of 
25 bp; [105,106]). Therefore, a bulk of beneficial mutations in 
Drosophila may be difficult to detect in genome-scans of variation 
but could contribute significandy to differences between species 
and overall adaptive rates of evolution. 

BGS baselines, however, are particularly adequate to detect 
genomic regions under balancing selection because the predicted 



genomic signature of an excess of diversity may not be always 
evident when compared to genome-wide averages or purely 
neutral expectations (see Figure 6). Indeed, the use of a B baseline 
across the whole D. melanogaster genome provides the adequate 
local N e context caused by purifying selection and corresponding 
linkage effects. In all, our study uncovers numerous candidate 
regions for balancing selection, identifying genes involved in 
sensory perception of chemical stimulus, antibacterial humoral 
response, olfactory behavior, inter-male aggressive behavior, 
defense response, etc. These genomic regions not only show a 
relative excess of polymorphic sites but also have segregating 
variants at higher frequency than the rest of the genome (another 
telltale sign of balancing selection). The results based on the study 
of a single population are appealing since heterogeneous 
environments or temporal changes predict the maintenance of 
local polymorphism through balancing selection [107-109], and 
are consistent with analyses of clines in D. melanogaster that detected 
the signature of local adaptation and spatially varying selection 
between populations [64, 1 1 0, 1 1 1] . 

Additionally, the results evidence a clear difference between 
autosomes and the X chromosome in terms of the consequences of 
variable B on levels of variation. While the X chromosome exhibits 
a reduced association between B and neutral diversity than 
autosomes, it shows a better fit between B and rates of protein 
evolution and efficacy of selection. These patterns are predicted by 
higher rates of recombination and a higher fraction of deleterious 
mutations with minimal role generating BGS effects in the X (see 
Material and Methods). Another factor possibly influencing this 
difference between X and autosomes is stronger efficacy of 
selection in the X chromosome [112]. Indeed, events of adaptive 
and/or stabilizing selection would distort levels of neutral diversity 
at linked sites beyond BGS predictions and stronger selection in 
the X would explain a reduced association between B and levels of 
diversity along the X relative to autosomes. Such combined 
scenario of reduced BGS effects and stronger selection would also 
explain a number of patterns observed in Drosophila, including an 
increased degree of synonymous codon usage bias [112-115], 
stronger purifying and positive selection acting at the level of 
protein evolution in X-linked genes [115,116], and the 'faster-X' 
effect [117,118] reported at both protein and gene expression 
levels [119-123]. Of interest will be the study of species with 
higher average recombination rate in autosomes than in the X, 
thus partially uncoupling differences in linkage effects and X- 
specific patterns. 

Moreover, the results of this and previous studies [26,87-89] 
suggest that recombination rates change frequendy in the 
Drosophila genus, and often involve differences in the distribution 
of recombination rates across the genome (i.e., a change of the 
recombination landscape). Very recent changes in recombination 
landscapes would uncouple extant recombination rates and 
polymorphism patterns generated during the last few — N e 
generations and, therefore, the reported contribution of BGS to 
patterns of diversity across the genome may be underestimated. 
On the other hand, changes in the recombination environment in 
species where BGS plays a significant role would also predict 
temporal variation in local B and impact a number of studies of 
selection that assume constancy of N e along lineages or that 
changes in N e should be equivalent across the genome. 

Temporally variable recombination landscapes can generate, 
for instance, spurious evidence for multimodality in the distribu- 
tion of fitness effects, or lineage-specific physical clustering of 
amino acid changes (such clustering has been observed among 
Drosophila species [124,125]). Another consequence of temporally 
variable recombination landscapes (and local B and J\Q would be 
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gene- or region-specific inequality of short- and long-term N e . 
Regions that have recently increased in recombination would 
show patterns of variation suggesting population expansion or bias 
estimates of a upward, making it less negative or even positive with 
no adaptive mutations [126]. Moreover, these changes in 
recombination landscape would also forecast substantial be- 
tween-gene variation in a without requiring adaptive evolution. 
At this point, therefore, there is the open possibility that positive 
estimates of a in Drosophila and other species with large population 
size (see [6,22] and references therein) may be influenced, to an 
unknown degree, by temporal changes in recombination rates and 
landscapes. Future analyses designed to estimate population size 
changes or the strength and frequency of adaptive events would, 
therefore, benefit from including variable BGS effects across 
genomes and along lineages, ideally discerning local variation in 
BGS (and local J\Q from genome-wide patterns that may represent 
true demographic events. Genome-wide analyses may also need to 
consider the non-negligible presence of regions under balancing 
selection to prevent overestimating the extent of recent sweeps 
based on a relative reduction in levels of diversity. 

Finally, a number of limitations and areas for future improve- 
ment need to be mentioned. In terms of selection, we investigated 
BGS models based on either a log-normal or a gamma DDFE 
[39-43,46,127]. Trying to include all possible mutations with 
deleterious effects across a genome into a single distribution is, 
however, a clear oversimplification. A combination of different 
DDFEs for different groups of genes and/ or sites would be a more 
fitting approach, ideally based on a priori information independent 
of levels of variation (e.g., based on amino acid composition, 
expression levels and patterns, connectivity, etc. [84]). 

To gain initial insight into the potential consequences of more 
realistic models, we obtained estimates of B under BGS models 
that allow for two DDFEs (one for nonsynonymous changes and 
one for changes at constrained noncoding sites; see Materials and 
Methods for details). We then investigated whether such models 
would alter our conclusions, mostly in terms of the proposed 
adequacy of using BGS predictions as baseline to detect outiiers. 
The results show that models with two DDFEs depart only 
marginally from the models with a single DDFE: rank correlations 
between estimates of B predicted by these hybrid models and all 
eight previous models show p ranging between 0.946 and 0.998. 
The association between predicted B and the variation of 7i s ii 
across the genome is also very high (albeit slightly lower than for 
models based on a single DDFE), with p&0.711at 100-kb scale 
and p^0.504 at 1-kb scale. Notably, the comparison of outliers 
generated by these 2-DDFE models with those obtained by the 
models described above reveals few differences. For instance, 50 
out of the 52 significant (FDR ^-value<0.10) 1-kb outlier regions 
based on the default model are also predicted to be equally 
significant outliers by a model assuming a log-normal DDFE for 
nonsynonymous mutations and a gamma for deleterious noncod- 
ing mutations (the other two regions show departure with P< 
0.0001). Overall, 87.1% of the 1,213 1-kb regions showing 
departure at P<0.01 using the default BGS model are also 
detected as outliers (P<0.01) under this hybrid model. These 
results further support the robustness of the proposed approach to 
detect outlier regions based on BGS baselines. 

Another line of future improvement is the physical resolution of 
our recombination maps and, perhaps more important, how well 
these maps represent the recent history of a population or species. 
To generate B landscapes, we used recombination maps that were 
experimentally obtained (independent of nucleotide variation data) 
and capture some intra-specific variation in recombination 
landscapes (see [3 1] for details). The fact that these B landscapes 



explain a very large fraction of the variance in nucleotide diversity 
across the genome suggests that these recombination maps 
represent quite accurately the recombination landscape in the 
recent history of D. melanogaster. Future recombination maps 
obtained from additional natural strains and populations, or under 
different biotic/ abiotic conditions, can only improve the confi- 
dence in outlier regions and, ultimately, our understanding of 
selective and demographic events in this species. 

Materials and Methods 

General BGS expectations 

The consequences of BGS at a given nucleotide position in the 
genome (focal point) can be described by B [12-15], the predicted 
level of neutral nucleotide diversity under circumstances where 
selection and linkage are allowed (n) relative to the level of 
diversity under complete neutrality and free recombination (7t 0 ). 
Following [15,16], we have 

n ^ UjSi 
B= — = exp — > =■ 

where a, is the deleterious mutation rate at the i-th selected site out 
of the n possibly linked sites, ,s, indicates the selection coefficient 
against a homozygous mutation and r, is the recombination 
frequency between the focal neutral site and the selected i-th site. 
Note that under a BGS scenario B can only be equal (no BGS 
effects) or lower than 1, and _B<<1 indicates strong BGS effects. 

Molecular evolutionary analyses indicate variable fitness effects 
of deleterious mutations [44,121,127,128] and, therefore, a 
probability distribution of deleterious fitness effects (DDFE) of 
mutations at site i, (j>(s^j needs to be included, generating 

00 

B = exp- ^ u t — — —2 dst 

,=o I (si + (lsi)rj) 

B, therefore, is predicted to vary across the genome as 
consequences of the known difference in recombination rates 
along chromosomes as well as due to the heterogeneous 
distribution of sites under selection across genomes (genes, exons, 
etc.). However, and because we are allowing the selection 
coefficient s to vary according to a distribution <j>(sj, some sites 
under selection may have selection coefficients too small to play 
any BGS effect, thus not all sites under section need to be included 
in the study [33]. We truncate the distribution of selection 
coefficients at St (jt~1AAQ because mutations with s<s-t are 
effectively neutral and do not contribute to BGS. Different 
DDFEs, therefore, will generate a different fraction of deleterious 
mutations with j>,s T and, therefore, different BGS effects (see 
below). 

High-resolution estimates of 8 across the whole D. 
melanogaster genome 

We investigated a model that assumes the possibility of strongly- 
deleterious mutations at nonsynonymous sites as well as at a 
fraction of noncoding sites, either untranslated genie (introns and 
5'- and 3' -flanking UTRs) or intergenic regions. B, therefore, can 
be estimated at any focal neutral site of the genome as the 
cumulative effects of deleterious mutations at any other non- 
synonymous, untranslated genie and intergenic site along the same 
chromosome. For simplicity and based on analyses of rates of 
evolution between Drosophila species [22,37,38,44,91], we assumed 
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that the distribution and magnitude of selection on intergenic or 
UTR noncoding strongly selected sites [(j> nc ^ s (s)] is equivalent to 
that on nonysynonymous sites [<t> m {s)]. 

We can estimate B at any focal neutral site of the genome using 
the equation described above, (j> aa (s), the actual genome annota- 
tion (D. melanogaster annotation release 5.47, http://flybase.org/) 
and high-definition recombination maps for crossing over and 
gene conversion events [31]. The analysis of every nucleotide sites 
along a single chromosomal arm taking into account all other sites 
under selection of this same chromosome would, however, require 
> 1 x 1 0 pair-wise nucleotide comparisons, each one requiring a 
numerical integration. To speed up the process we followed [17] 
and first obtained the integral 



J (S aa + (1 -Saa)r) 



rds 



along a continuum for possible recombination rates between two 
sites, in our case for r ranging between 1x10 10 (equivalent to 
c-O.Ql) to 1; we assumed r=0 whenever the recombination 
distance between two nucleotides is smaller than lxlO -10 . 
Because he recombination distance between any pair of nucleo- 
tides can be obtained, the generation of complete maps of B is now 
more tractable although still computationally very intensive. We 
then made the simplifying assumption of ignoring variation within 
1-kb windows when estimating B at any another 1-kb window 
along the chromosome. That is, for the purpose of generating BGS 
effects, all sites within a 1-kb window have an equivalent 
recombination rate with sites within any other 1 -kb window along 
the chromosome. This is a reasonable approximation at this time 
due to the fact that the resolution of our best genome-wide 
recombination maps is 100-kb and differences in recombination 
due to a few nucleotides play a minor role when comparing sites 
separated by tens of hundreds of kbs. 

By dividing a chromosome arm into L adjacent 1-kb windows, B 
at a focal neutral site located at the center of the j-th window (Bj), 
can be estimated by using: 



Xj = ^ Naa i u * 



Yj = Nutrssi u u 



Zj = NncsSi 



<j>{s aa )s ai 



<KSqq) So, 



<\>{s aa ) So, 



Bj=exp -(Xj+Yj + Zj) 

where jVaa,, Nutr_ss{ and Nnc_ssi are the number of nonsynon- 
ymous, UTR and intergenic sites possibly under strong selection 
within window i, respectively, and is the recombination between 
the center of the focal window j and the center of window i, with ij, 
increasing in 1-kb intervals. u aa , u ulr _ ss , and u nc _ ss axt the 
deleterious mutation rate at nonsynonymous, UTR and intergenic 
sites , respectively. Because all sites in the genome are actually 



being taken into account, this approach avoids the need to 
interpolate B estimates and generates a very detailed B landscape. 

Deleterious mutation rates. Initial estimates of the muta- 
tion rate based on mutation accumulation lines in D. melanogaster 
suggested a mutation rate (a) of ~8.4xl0~ /bp/generation, for 
point mutations and small indels [47]. More recent analyses 
suggest u ~4-5 x 10 9 /bp/ generation after removing data from a 
line with unusually high mutation rates [48-50]. Based on the 
fraction of conserved sites in exons and noncoding sequences 
[22,37], this lower mutation rate predicts a diploid rate of 
deleterious mutations per generation (U) of 0.6 for euchromatic 
regions. Z7~0.6, however, is an underestimate of the true U in D. 
melanogaster because it does not take into account the recent 
bottleneck experienced by North America populations (see [48]) or 
the deleterious consequences of TE insertions within genes and 
regulatory sequences. Additionally, the elevated mutation rate 
observed in one line (line 33 [47,48]) is not a new trait developed 
during the mutation accumulation experiment and, therefore, 
such genotype was present in the initial natural population of D. 
melanogaster. We, thus, use U — 0.6 as a lower boundary for the true 
deleterious mutation rate across the euchromatic genome (in 
models M LowMut ). 

TEs are pervasive across the D. melanogaster genome [51—60]. 
Cridland et al (2013) recently reported a detailed analysis of TE 
abundance and distribution in D. melanogaster based on deep- 
coverage, whole-genome sequencing [60]. Their analysis of the 
Drosophila Synthetic Population Resource (DSPR; [129]) showed 
a total of 7,104 TE insertions, with 633 TEs inserted within exons. 
To include the deleterious consequences of TE insertion, 
therefore, we obtained an approximate mutation (insertion) rate 
(w TE ) based on mutation-selection balance predictions and the 
presence of TEs in genomic regions where they are most likely to 
cause deleterious effects (i.e., within exons). [We use TE presence 
in the DSPR lines instead of analyses based on the Drosophila 
Genetic Reference Panel (DGRP) due to the higher sequencing 
coverage in the DSPR lines, which increases the likelihood to 
detect and map TE insertions (see [60] for details)]. For non- 
recessive strongly deleterious mutations, mutation-selection bal- 
ance predicts a probability of segregation P seg that is the product of 
the equilibrium frequency (q) and the number of chromosomes 
analyzed («). P seg in exons (/bp) is 0.00003 (633 TEs/2 1,000,000 
exonic bp) and q = P seg /n = 2 x 10 6 . Based on the very low 
frequency of segregating TEs in natural populations (i.e., most TEs 
in this study were present in a single genome in agreement with 
previous analyses ofD. melanogaster populations; see [53,57,58]), we 
assume non-recessive deleterious effects and a heterozygous 
selection coefficient (s h ) against TE insertions within exons to be 
equal or greater than for amino acid mutations (5&0.0025), so Kte 
can be estimated by kte = 9 -*/« with kte— 5 x 10~ 9 /bp/generation. 
Thus, we obtain £/te~0.6, a result in agreement with previous 
analyses of TE transposition rate that suggested that the genomic 
rate of TE transposition is of the same order as the nucleotide 
spontaneous mutation rate [55], and suggests U~\.2 when 
including point mutations, small indels and TE insertions across 
the euchromatic regions of the D. melanogaster genome. 

Distribution of deleterious effects. The true probability 
distribution of selection coefficients against newly arising muta- 
tions is not known. Sensibly, no single deleterious distribution of 
fitness effects (DDFE) is likely to capture mutations at different 
genes or sites within genes. Different spatial and temporal 
conditions will, likely, further alter such DDFEs. As an approx- 
imation, we studied two DDFEs and the corresponding param- 
eters proposed for Drosophila to partially capture the potential 
effects of different DDFEs on our results and conclusions: a 
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gamma distribution [39-44] and a log-normal distribution 
[45,46], with the latter allowing to capture the presence of lethal 
mutations and fitting better D. melanogaster polymorphism data 
[45,46]. 

Following Loewe and Charlesworth [45], the log-normal DDFE 
has a median s of 0.000231 and standard deviation 5.308 for 
autosomes. A gamma DDFE for Dro.sophila is described by a shape 
parameter k= 0.3 and mean s of 0.0025 [33,38]. In both cases, we 
assumed a dominance coefficient h of 0.5. For models that include 
the contribution of TE insertions (t7=1.2), we further assumed 
that the DDFE of TEs is the same than of point mutations, and 
that the genomic sites where a TE insertion has deleterious effects 
are the same where point mutations and small indels have 
deleterious effects. Note that under the models that include a log- 
normal DDFE, the median s against individual TE insertions 
(0.000231) is equivalent to previous estimates in D. melanogaster of 
0.0002 [54]. 

An interesting attribute of the log-normal distribution is that it 
predicts a higher frequency of extreme values, including effectively 
neutral (s<s-y) and lethal (s>l) mutations, than those predicted by 
a gamma distribution. Because effectively neutral and strongly 
deleterious mutations play a minimal role removing linked 
variation [130], a log-normal DDFE is predicted to generate 
weaker BGS effects than a gamma DDFE with similar median s. 
Finally, it is worth noting that the parameters for these two DDFEs 
were obtained independently of a BGS model and therefore are 
not expected to maximize BGS as explanation of the observed 
patterns of diversity across the genome while, at the same time, 
may represent underestimates of the true magnitude of selection. 

BGS with crossover and gene conversion along D. 
melanogaster chromosomes. To estimate the recombination 
frequency between a focal neutral site and selected sites in BGS 
formulae, we used the whole-genome high-resolution recombina- 
tion maps from D. melanogaster described in [31]. These maps 
describe crossing over and gene conversion rates along chromo- 
some arms (except the small fourth chromosome) at 100-kb 
resolution. Following Langley et al ([131]; see also [45,132]), the 
total recombination frequency between a focal neutral site and a 
selected i-th site taking into account crossing over and gene 
conversion is: 

r = dr C0 +{lyL GC {\-e- d l L Gcy^ 

where d is the distance between the focal and the selected sites in 
bp, r C o is the rate of crossing over (/bp/female meiosis), y is the 
rate of gene conversion initiation (/bp/ female meiosis) and Lqc is 
the average gene conversion tract length. Based on [31], 
y = 1.25xl0~ 7 /bp/female meiosis and Z GC =518bp. In this 
study, BGS expectations were investigated using recombination 
rates caused by crossing over alone and with the more realistic 
recombination between sites based on the combined effect of 
crossing over and gene conversion. When not explicitly indicated, 
recombination using both crossing over and gene conversion is 
employed. 

BGS models and parameters 

We initially investigated eight BGS models using the formulas 
described above, with two DDFEs (log-normal or gamma; models 
M L n and M G , respectively), two deleterious mutations rates 
(U= 1.2 or 0.6; models M Stc iMut an d M LowMut , respectively), and 
two recombination scenarios (with only crossovers or crossovers 
and gene conversion events; models M GO an d M C o+gc> 
respectively). For instance, the full notation of a model that uses 



a log-normal DDFE, a deleterious mutation rate of [7=0.6, and 
recombination that only considers crossovers is M M Ll]wMutjC0 . 

To obtain the number of sites Naa, Nutr_ss and Nnc_ss for each 
1 -kb region, we followed the approach described by Charlesworth 
[33]. We assume 0.75 as the fraction of coding sites that alter 
amino acid sequences and correct this fraction by the proportion 
of constrained nonsynonymous sites (cs) to focus only on the 
fraction of deleterious mutations. In D. melanogaster, cs for amino 
acid sites is ~0.92 [22,37] and, thus, Naa for a regions is 
A«&k*0.75 x0.92. To obtain Mutr_ss, we correct the number of 
noncoding genie sites using the proportion of constrained sites 
(0.56 for introns and 0.81 for flanking UTRs [22]) and, 
equivalendy, we use the proportion of constrained sites at 
intergenic sequences (~0.5) [22,37] to obtain Nnc_ss. 

The deleterious mutation rate per bp (u) will be different for 
different models, due to the overall mutation rate and because 
different DDFEs predict a different fraction of deleterious 
mutations with s<si that will not contribute to BGS. The two 
mutation rates investigated here, U= 0.6 (M LowMut ) and 1.2 
(MstdMut), represent a neutral mutation rate of 4.2x10 9 and 
8.4xl0~ 9 /bp/generation. Assuming N e = 1,000,000 for D. mela- 
nogaster, the log-normal and gamma DDFEs described above 
predict 15.3 and 7.4% of deleterious mutations with s<s T , 
respectively. Therefore, the deleterious mutation rate relevant 
for BGS is 84.7 (M LN ) and 92.6% (M G ) that of the mutation rate. 

BGS models with two DDFEs. All previous BGS models 
assume a single DDFE for sites under selection. As a first 
approximation towards more realistic description of selection 
across the genome, we investigated two additional models that 
allow for two DDFEs, one for nonsynonymous mutations and the 
other for deleterious noncoding mutations (each DDFE following 
the parameters indicated above). The B landscapes generated by 
these hybrid models are intermediate relative to the other eight 
models: a model assuming a log-normal DDFE for nonsynon- 
ymous mutations and a gamma for deleterious noncoding 
mutations (M LN/G StdMut CO+GC ) shows a genome-wide median B 
of 0.469 whereas a model with a gamma DDFE for nonsynon- 
ymous mutations and a log-normal for deleterious noncoding 
mutations (M G/L N j stdMut,co+Gc) shows a median B of 0.459. 
Spearman's rank p between estimates of B across the genome from 
these 2-DDFE models and the other eight single-DDFE models 
range between 0.974 and 0.998 for M m/G;StdMut;C0 . +G C) an d 
between 0.946 and 0.986 for M G /LN,stdMut,co+GC- 

X chromosome. For the X chromosome, we have assumed 
equivalent deleterious mutations rates than for autosomes and 
adjusted the distribution of selection coefficients to sx=2/3 s A 
(assuming equal number of breeding males and females in the 
population and genie selection, A = 0.5). Estimates of the ratio of 
observed neutral diversity for the X and autosomes (X/A) also 
assume equal number of males and females, with X/ 
A = 0.75 x7? x /^A. Equivalendy, St for the X chromosome varies 
from that for autosomes, with s-r for the X chromosome now 
satisfying M e x sx~ 1 • For the X chromosome, the log-normal and 
gamma DDFEs predict 18.6 and 9.1% of deleterious mutations 
with s<St, respectively (both percentages higher that for 
autosomes, see above). Finally, we allowed for an effective 
difference in recombination rates due to the lack of meiotic 
recombination in Drosophila males, with 0.5r and 0.66r for 
autosomes and the X chromosomes, respectively. For each of 
the BGS models under study, we obtained specific solutions to the 
integrals described above along a continuum of recombination 
rates for the X chromosome and autosomes separately. 

Complete and trimmed chromosomes. All analyses were 
carried out based on the study of complete chromosome arms as 
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well as after removing euchromatic regions near to centromeres 
and telomeres with evident reduction in crossover rates [133]. 
Following [26] we use the term trimmed chromosomes or trimmed genome 
to designate genomic regions after removing sub-centromeric and 
-telomeric regions with strongly reduced crossover rates. Sub- 
centromeric regions with reduced crossover rates were assigned by 
starting at the centromere and moving into the chromosome arm 
until a minimum of 3 consecutive 100-kb windows showed 
crossover rates > 1 cM/Mb. Sub-telomeric regions with evident 
reduction in crossover rates were assigned in an equivalent 
manner, by starting at the telomere and moving towards the center 
of the chromosome until a minimum of 3 consecutive 100-kb 
windows showed >1 cM/Mb. 

Estimates of neutral nucleotide diversity across the D. 
melanogaster genome 

Nucleotide diversity across the D. melanogaster genome was 
estimated from a sub-Saharan African population (Rwanda, RG, 
population of the Drosophila Population Genetics Project, DPGP; 
www.dpgp.org/ and [62]). D. melanogaster is thought to have 
originated in sub-Saharan Africa, and eastern Africa — including 
Rwanda — in particular [134]. Our use of the RG population, 
therefore, minimizes the non-equilibrium effects caused by recent 
expansion observed in western Africa and non-African D. 
melanogaster populations [62,134]. Additionally, the RG population 
combines a relatively large sample (n = 2 7), and low and well 
characterized levels of admixture [62] . We followed Pool et al. 
(2012) and used updated assemblies from the DPGP2.v3 site (see 
http://www.dpgp.org/dpgp2/DPGP2.html and [62] for details). 
We analyzed these assemblies with, 1) sites putatively heterozygous 
or with quality value smaller than Q31 masked to 'N', and 2) 
putatively admixed regions of African genomes filtered to "N" 
based on the description of admixed regions from [62] (also 
available from http://www.dpgp.org/dpgp2/DPGP2.html). Fi- 
nally, we only investigated non-N sites that were present in a 
minimum of 10 sequences. Equivalent results were obtained when 
the analyses were restricted to sites with a minimum of 15 
sequences, or after removing regions showing excess admixture as 
well as regions showing excess long-range identity-by-descent 
(IBD) [62] (data not shown). 

Neutral (silent) diversity was estimated as pairwise nucleotide 
variation per site at intergenic sequences and introns (t^u) and 
following the approach described in [31]. In short, we first 
annotated all gene models, transposable elements and repetitive 
sequences onto the reference sequence allowing for overlapping 
annotation. Intergenic sites correspond to sites between gene 
models (excluding annotated UTRs), transposable elements or 
repetitive sequences. Intronic sites are analyzed as such only when 
they never overlap with another annotation (e.g., with alternatively 
spliced exons or other elements within introns). After this filtering 
approach, intergenic and intronic sites show similar levels of silent 
diversity, with a very weak tendency for 7t s a in introns 
(tintrons = 0.0085) to be higher than in intergenic (itmter- 
geni<: = 0.0082) sites (Sign test, P= 0.034 and P= 0.056, along 
autosomes and X chromosome, respectively). 

Unless indicated otherwise, diversity analyses were performed 
using all 100-kb non-overlapping windows across the genome 
whereas analyses of 1 0-kb and 1 -kb regions were limited to regions 
with more than 1,000 and 500 silent sites, respectively. Inferences 
about the frequency of segregating variants within the population 
were based on estimates of Tajima's D [69] after normalizing by 
Ztoin {D/Drtaa) following Schaeffer (2002) [135]. Equivalent 
conclusions were obtained when using DGRP sequenced strains 



[136] from a North American natural population (Raleigh, NC, 
USA). 

Outliers of nucleotide diversity. To parameterize depar- 
tures of observed levels of diversity relative to the levels 
predicted by B, we applied a generalized regression model (GRM), 
and obtained regression residuals 7t s u.R. In particular, we obtained 
studentized residuals to detect outlying values and obtain 
associated probabilities. Studentized residuals are equivalent to 
standardized residuals (the ratio of the residual to its standard 
error) and have a Student's t distribution, but in the case of 
studentized residuals the data point or observation under study 
(possibly an outlier) is removed from the analysis of the standard 
error. Because the sample size in our analyses is very large and the 
fraction of outiiers small, studentized and standardized residuals 
are almost equivalent and generate the same outiler regions at the 
three physical scales analyzed. We applied GRM separately to 
autosomes and X chromosome to prevent including assumptions 
about the true X/A ratio. To correct for multiple tests, we applied 
the Benjamini and Hochberg (1995) method with FDR = 0.10. 

Estimates of rates of protein evolution 

Genes and sequence alignments. We obtained CDS 
sequence alignments in the 5 species of the D. melanogaster 
subgroup (D. melanogaster, D. simulans, D. sechellia, D. yakuba and 
D. erecta) based on multiple-species alignments available from the 
UCSC Genome Browser (http:/ /hgdownload.cse. ucsc.edu/ 
goldenPath/ dm3/ multiz 1 5way/ alignments/ flyBaseGene.exonNuc. 
fa.gz) and current gene annotation and genomic location in D. 
melanogaster (http://flybase.org; D. melanogaster annotation release 
5.47, 10/9/2012). We removed genes with coding sequences in D. 
melanogaster shorter than 450 bp or with fewer than 100 amino 
acids positions present in all five species after alignment. We also 
removed genes presenting premature stop codons in at least one 
species relative to the stop codon position in D. melanogaster. Finally, 
for genes with multiple transcript forms we chose the CDS 
alignment corresponding to the longest transcript. In total, we 
investigated rates of evolution in 6,876 protein-coding genes. 

Rates of protein evolution. We obtained d^ (the number of 
nonsynonymous substitutions per nonsynonymous site), d§ (the 
number of synonymous substitutions per synonymous site)and CO 
(the ratio d^/ d$) for each gene. To this end, we used the program 
codeml as implemented in PAML v 4.5 [137,138] and applied a 
branch model that allowed different co in all internal and external 
branches of the five-species tree. We focused on co along the D. 
melanogaster branch after split from the D. simulans lineage to 
investigate recent patterns of efficacy of selection on amino acid 
changes and the possible association with BGS effects across the 
genome based on recombination rates estimated in D. melanogaster. 

We followed Larracuente et al. (2008) [84] to prevent 
combining genes with amino acids evolving under positive 
selection and genes with most amino acids evolving under a 
nearly neutral scenario, either purifying selection or neutrality. 
We, therefore, applied codeml and compared Model Mia (nearly 
neutral evolution) against a model that allows the additional 
presence of positive selection at a fraction of sites (model M2a). In 
particular, we compared maximum likelihood estimates (MLEs) 
under these two models and applied a likelihood ratio tests (LRTs) 
[139,140] to detect differences between the models. A total of 125 
genes showed evidence of positive selection in the D. melanogaster 
lineage after applying the Benjamini and Hochberg (1995) method 
to correct for multiple tests (FDR = 0.05). To further reduce the 
incidence of genes under positive selection and/or recent 
pseudogenization we removed genes showing co greater than 
0.75 (note that the median co in the D. melanogaster lineage is 
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0.0696). In all, we investigated 6,677 genes with no evidence of 
positive selection or drastic reduction in constrains along the D. 
melanogaster lineage. 

Finally, we took into account the consequences of variable 
recombination and B on the efficacy of selection on synonymous 
mutations. In D. melanogaster, synonymous mutation are under weak 
selection ([113] and references therein) and therefore a reduction in 
efficacy of selection is predicted to increase ds, possibly biasing the 
direct use of the ratio d N /d s . On the other hand, the inclusion of d & 
corrects for possible differences in coalescent times and ancestral 
polymorphism across the chromosome. We then applied generalized 
linear models, GRM, and obtained regression residuals of ft) along 
the D. melanogaster 'lineage after controlling for d s ((Br) and used 0)r as 
an estimate of variable efficacy of selection on amino acid changes. 
Equivalent results are obtained when we used regression residuals of 
</n along the D. melanogaster lineage after controlling for 4; (aW-r) as an 
estimate of variable efficacy of selection on amino acid changes. We 
obtained regression residuals of ft) or i N for autosomes and the X 
chromosome separately. 

Forward computer simulations and estimates of the 
proportion of adaptive substitutions 

We used the program SLIM [94] to capture the consequences of 
temporal changes in recombination rates on estimates of i. 
Simulations followed a panmictic population of 10,000 diploid 
individuals (jV) and a chromosome segment of 1 Mb that contained 
1 00 protein encoding genes evenly distributed, one every 1 0 kb. Each 
10-kb region included a typical Drosophila gene: a 1,000 bp 5' UTR, a 
first short 300-bp exon, a 1,000 bp first intron, two additional 600 bp 
exons, a short 200-bp internal intron, and a 300 bp 3'-UTR, 
followed by a 5,000 bp intergenic sequence. Mutations were assigned 
to have the same parameters than those in the BGS models described 
above, with two mutation types: neutral and deleterious. The 
population mutation rate was set to Nu = 0.005 and deleterious 
mutations were assumed to follow a gamma DDFE (h = 0.5) with 
average Ns= —2,500 [33,38]. The proportion of deleterious muta- 
tions at the different genomic elements was set to 0.92 at first and 
second codon positions, 0.81 at UTRs, 0.56 at introns, and 0.5 at 
intergenic sites [22,37]; all third codon positions evolved neutrally. 
Note that the simulation of a smaller population would prevent the 
study of deleterious mutations with the appropriate scaled selection 
for Drosophila (M= —2,500) while the simulation of shorter genomic 
sequences would severely underestimate BGS effects in regions with 
non-reduced recombination. 

Simulations followed 10 independent populations a minimum 
of 50 M generations after reaching equilibrium (>10 JV 
generations). Recombination rates were uniformly distributed, 
with a population crossover rate (/bp) of Mco — 0.04 and 
Nr co = 0.0025 for periods of high and low recombination, 
respectively. To prevent overestimating BGS effects we also 
included a constant rate of gene conversion initiation (/bp) of 
My — 0.05 and average gene conversion tract length of Lqc = 518 
[31]. After the initial lOjV generations, we sampled the population 
every 0. IN generations, obtained polymorphism data from 20 
randomly drawn chromosomes, and compared them to a 
sequence that evolved independently for 20JV generations to 
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